If we want to identify groups or clusters using expression data, we often want to validate those groups in some way, such as determining the “correct” number of groups or checking for agreement within clusters.
In this notebook, we’ll use the medulloblastoma data from the OpenPBTA project to look at two techniques for cluster validation and look how clusters overlap with sample labels that are available in our metadata.
# Bit o' data wranglin'
library(tidyverse)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
[37m── [1mAttaching packages[22m ─────────────────────────────────────────────── tidyverse 1.3.0 ──[39m
[37m[32m✓[37m [34mggplot2[37m 3.3.0 [32m✓[37m [34mpurrr [37m 0.3.4
[32m✓[37m [34mtibble [37m 3.0.1 [32m✓[37m [34mdplyr [37m 0.8.3
[32m✓[37m [34mtidyr [37m 1.0.0 [32m✓[37m [34mstringr[37m 1.4.0
[32m✓[37m [34mreadr [37m 1.3.1 [32m✓[37m [34mforcats[37m 0.4.0[39m
package ‘readr’ was built under R version 3.6.2package ‘dplyr’ was built under R version 3.6.2[37m── [1mConflicts[22m ────────────────────────────────────────────────── tidyverse_conflicts() ──
[31mx[37m [34mdplyr[37m::[32mfilter()[37m masks [34mstats[37m::filter()
[31mx[37m [34mdplyr[37m::[32mlag()[37m masks [34mstats[37m::lag()[39m
# Consensus clustering library
library(ConsensusClusterPlus)
# Library we'll use for silhouette values
library(cluster)
library(ComplexHeatmap)
Loading required package: grid
========================================
ComplexHeatmap version 2.2.0
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
If you use it in published research, please cite:
Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
genomic data. Bioinformatics 2016.
========================================
# We'll make a 3D plot with this library
library(plotly) # TODO: install globally
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching package: ‘plotly’
The following object is masked from ‘package:ComplexHeatmap’:
add_heatmap
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
# Directory with the prepared/lightly cleaned OpenPBTA data
data_dir <- file.path("data", "open-pbta", "processed")
# Create a directory to hold our cluster validation results if it doesn't
# exist yet
results_dir <- "results"
if (!dir.exists(results_dir)) {
dir.create(results_dir, recursive = TRUE)
}
histologies_file <- file.path(data_dir, "pbta-histologies-stranded-rnaseq.tsv")
rnaseq_file <- file.path(data_dir, "pbta-vst-stranded.tsv")
cc_results_file <- file.path(results_dir, "consensus_clustering_results.RDS")
Let’s read in the sample metadata and get the sample identifiers for medulloblastoma samples.
# Read in metadata TSV
histologies_df <- read_tsv(histologies_file)
Parsed with column specification:
cols(
.default = col_character(),
OS_days = [32mcol_double()[39m,
age_last_update_days = [32mcol_double()[39m
)
See spec(...) for full column specifications.
# Only pull out sample identifiers (KidsFirst biospecimen identifiers) that
# correspond to medulloblastoma samples
medulloblastoma_samples <- histologies_df %>%
filter(short_histology == "Medulloblastoma") %>%
pull(Kids_First_Biospecimen_ID)
# Read in transformed RNA-seq data
rnaseq_df <- read_tsv(rnaseq_file)
Parsed with column specification:
cols(
.default = col_double(),
gene_id = [31mcol_character()[39m
)
See spec(...) for full column specifications.
# For our clustering validation analyses, we want a matrix of only
# medulloblastoma samples where the gene identifiers are rownames rather than
# in the first column
rnaseq_mat <- rnaseq_df %>%
# This makes sure we retain all of the columns with biospecimen IDs that
# correspond to medulloblastoma samples
select(gene_id, all_of(medulloblastoma_samples)) %>%
tibble::column_to_rownames("gene_id") %>%
as.matrix()
The first method we’ll use is called consensus clustering. Consensus clustering to finds the “consensus” across multiple runs of the algorithm using a resampling procedure.
We’ll use the package ConsensusClusterPlus that we loaded up top.
The consensus clustering methodology was first introduced in Monti et al. Machine Learning. 2003.
Consensus clustering is one way to help you determine the number of clusters in your data, but it is not the only methodology available. Check out the Data Novia course Cluster Validation Essentials by Alboukadel Kassambara for a deeper dive.
cc_results <- ConsensusClusterPlus(rnaseq_mat,
maxK = 15,
# Setting this seed is necessary for the
# results to be reproducible
seed = 2020,
innerLinkage = "average",
finalLinkage = "average",
distance = "pearson")
end fraction
It seems like there are 3 main stable clusters from the consensus clustering results. A cluster of a few samples may not be that helpful in reaching our analysis goals.
Let’s take a look at the class labels for k = 9.
# table() creates a contingency table of counts
#
table(cc_results[[9]]$consensusClass)
1 2 3 4 5 6 7 8 9
76 1 25 11 2 2 2 1 1
(Note: the numbering of the clusters is arbitrary here.)
Let’s consider clusters 1, 3, and 4 for further analysis. But first, we’ll save the entirety of the consensus clustering results to file.
# Write consensus clustering results to file
write_rds(cc_results, path = cc_results_file)
And now to extract the samples in the clusters of interest. What we used table() on before is actually a named vector.
cc_cluster_labels <- cc_results[[9]]$consensusClass
head(cc_cluster_labels)
BS_09Z7TC35 BS_1AYRM596 BS_1BWP5MCT BS_1QXEC43H BS_1TWCV047 BS_2SFTWNVE
1 1 2 3 1 1
We’ll extract the names (biospecimen IDs) for samples in clusters 1, 3, and 4, which contain the majority of samples.
sample_index <- which(cc_cluster_labels %in% c(1, 3, 4))
samples_in_clusters <- names(cc_cluster_labels)[sample_index]
Check what proportion of total samples are in one of these three clusters.
# number of samples that met our logical criterion above divided by total
# number of samples
length(sample_index) / length(cc_cluster_labels)
[1] 0.9256198
And now we’ll filter the RNA-seq matrix.
filtered_rnaseq_mat <- rnaseq_mat[, samples_in_clusters]
dim(filtered_rnaseq_mat)
[1] 50328 112
The silhouette coefficient or value is a measure of cluster consistency that ranges from -1 to 1. It is calculated on a per-sample basis – it measures how similar a sample is to the cluster it’s in compared to other clusters (Wikipedia entry for silhouette).
We’ll use the cluster package to compute our silhouette values for the clusters we identified using consensus clustering above.
The silhouette() function takes the class labels and dissimilarities. Let’s use Pearson correlation as we have been throughout.
all.equal(names(cc_cluster_labels[sample_index]),
colnames(filtered_rnaseq_mat))
[1] TRUE
# Calculate the Pearson correlation between samples (columns of the matrix)
mb_sample_correlation <- cor(filtered_rnaseq_mat,
use = "pairwise.complete.obs",
method = "pearson")
# Dissimilarity means we need to subtract these values from 1
mb_sample_dist <- as.dist(1 - mb_sample_correlation)
Let’s calculate the silhouette values and then plot them.
silhouette_results <- silhouette(x = cc_cluster_labels[sample_index],
dist = mb_sample_dist)
plot(silhouette_results)
There are multiple medulloblastoma molecular subtypes and this classification largely relies on gene expression data. A medulloblastoma subtype classifier, which is an example of supervised machine learning, has been applied to the medulloblastoma samples included in OpenPBTA. How do the subtype labels from this classifier (in the molecular_subtype column of our sample metadata) stack up to the clusters we identified with unsupervised methods?
Let’s first make a data frame that holds the subtype labels. We can use this both to compare our unsupervised clustering results to the subtype labels and for some plotting downstream.
mb_molecular_subtype_df <- histologies_df %>%
filter(short_histology == "Medulloblastoma") %>%
select(Kids_First_Biospecimen_ID, molecular_subtype)
Add the consensus clustering labels.
# Create a data frame that contains the consensus cluster results and join
# it with the data frame of molecular subtype labels
cc_df <- data.frame(cc_cluster_labels) %>%
tibble::rownames_to_column("Kids_First_Biospecimen_ID") %>%
inner_join(mb_molecular_subtype_df)
Joining, by = "Kids_First_Biospecimen_ID"
Do the consensus clustering results agree with the molecular subtype labels?
table(cc_df$cc_cluster_labels, cc_df$molecular_subtype)
Group3 Group4 SHH WNT
1 13 55 6 2
2 1 0 0 0
3 0 7 18 0
4 1 2 0 8
5 0 0 1 1
6 0 0 2 0
7 0 2 0 0
8 0 0 1 0
9 0 1 0 0
Hm… there’s some agreement with the subtype labels but it’s not perfect. Why might that be? Let’s start by looking at the overview figure for the medulloblastoma classifier.
We used a different measure (VST values vs. FPKM) and used all features.
Let’s see if low-dimensional representations agree with our other unsupervised results. First, let’s find the high variance genes; we’ll perform dimension reduction on those genes only.
# Calculate variance
gene_variance <- matrixStats::rowVars(rnaseq_mat)
# Find the value that we'll use as a threshold to filter the top 5%
variance_threshold <- quantile(gene_variance, 0.95)
# Row indices of high variance genes
high_variance_index <- which(gene_variance > variance_threshold)
First, let’s use UMAP.
# Set seed for reproducible UMAP results
set.seed(2020)
# umap() expects features (genes) to be columns, so we have to use t()
umap_results <- umap::umap(t(rnaseq_mat[high_variance_index, ]))
The UMAP coordinates are in the layout element of the list returned by umap::umap().
# Make a data frame of the layout results and join with molecular subtype
# data frame
umap_plot_df <- data.frame(umap_results$layout) %>%
tibble::rownames_to_column("Kids_First_Biospecimen_ID") %>%
inner_join(mb_molecular_subtype_df)
Joining, by = "Kids_First_Biospecimen_ID"
Let’s make a scatter plot and color our samples by the subtype labels.
umap_plot_df %>%
ggplot(aes(x = X1,
y = X2,
color = molecular_subtype)) +
geom_point() +
colorblindr::scale_color_OkabeIto() +
theme_bw() +
xlab("UMAP1") +
ylab("UMAP2")
Now, we’ll briefly show you how to use built-in functions for PCA on any matrix that’s not in a specialized object for some kind of genomic data.
# Like umap(), prcomp() expects the features to be columns
pca_results <- prcomp(t(rnaseq_mat[high_variance_index, ]),
scale = TRUE)
# The low-dimensional representation is returned in x
pca_results$x[1:6, 1:6]
PC1 PC2 PC3 PC4 PC5 PC6
BS_09Z7TC35 -10.305658 -30.486253 21.016270 -1.004493 11.720278 -8.9093995
BS_1AYRM596 -12.936747 4.858999 7.484443 -9.923455 1.800493 14.2377837
BS_1BWP5MCT -3.198626 -37.832010 9.673423 11.902507 7.611624 -11.4215716
BS_1QXEC43H 37.056562 13.450708 7.197838 1.012471 5.170360 -5.4456082
BS_1TWCV047 -17.895880 -13.279184 1.373011 14.065867 6.912049 -0.1576687
BS_2SFTWNVE -15.486259 21.313599 -14.310736 13.851413 -12.197892 3.5925658
Let’s get the first 10 PCs ready for plotting.
pca_plot_df <- data.frame(pca_results$x[, 1:10]) %>%
tibble::rownames_to_column("Kids_First_Biospecimen_ID") %>%
inner_join(mb_molecular_subtype_df)
Joining, by = "Kids_First_Biospecimen_ID"
And now make a scatterplot.
pca_plot_df %>%
ggplot(aes(x = PC1,
y = PC2,
color = molecular_subtype)) +
geom_point() +
colorblindr::scale_color_OkabeIto() +
theme_bw() +
xlab("PC1") +
ylab("PC2")
plotly is a package that allows us to plot interactive 3D scatterplots.
pca_3d_plot <- plot_ly() %>%
add_trace(data = pca_plot_df,
x = ~ PC1,
y = ~ PC2,
z = ~ PC3,
mode = "markers",
type = "scatter3d",
color = ~ molecular_subtype,
colors = colorblindr::palette_OkabeIto[1:4])
pca_3d_plot <- pca_3d_plot %>%
layout(scene = list(
xaxis = list(title = "PC1"),
yaxis = list(title = "PC2"),
zaxis = list(title = "PC3")
))
pca_3d_plot
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] plotly_4.9.2.1 ComplexHeatmap_2.2.0 cluster_2.1.0
[4] ConsensusClusterPlus_1.50.0 forcats_0.4.0 stringr_1.4.0
[7] dplyr_0.8.3 purrr_0.3.4 readr_1.3.1
[10] tidyr_1.0.0 tibble_3.0.1 ggplot2_3.3.0
[13] tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] Biobase_2.46.0 httr_1.4.1 jsonlite_1.6.1 viridisLite_0.3.0
[5] modelr_0.1.5 assertthat_0.2.1 askpass_1.1 cellranger_1.1.0
[9] yaml_2.2.1 pillar_1.4.4 backports_1.1.7 lattice_0.20-38
[13] glue_1.4.1 reticulate_1.14 digest_0.6.25 RColorBrewer_1.1-2
[17] rvest_0.3.5 colorspace_1.4-1 htmltools_0.4.0 Matrix_1.2-17
[21] pkgconfig_2.0.3 GetoptLong_0.1.8 broom_0.5.6 haven_2.2.0
[25] scales_1.1.0 RSpectra_0.16-0 openssl_1.4.1 farver_2.0.3
[29] generics_0.0.2 ellipsis_0.3.1 withr_2.2.0 umap_0.2.6.0
[33] BiocGenerics_0.32.0 lazyeval_0.2.2 cli_2.0.2 magrittr_1.5
[37] crayon_1.3.4 readxl_1.3.1 evaluate_0.14 fs_1.4.1
[41] fansi_0.4.1 nlme_3.1-140 xml2_1.3.1 tools_3.6.1
[45] data.table_1.12.8 hms_0.5.3 GlobalOptions_0.1.1 lifecycle_0.2.0
[49] matrixStats_0.55.0 munsell_0.5.0 reprex_0.3.0 compiler_3.6.1
[53] rlang_0.4.6 rstudioapi_0.11 rjson_0.2.20 htmlwidgets_1.5.1
[57] circlize_0.4.8 crosstalk_1.1.0.1 base64enc_0.1-3 rmarkdown_2.1
[61] labeling_0.3 colorblindr_0.1.0 gtable_0.3.0 DBI_1.1.0
[65] R6_2.4.1 lubridate_1.7.4 knitr_1.28 clue_0.3-57
[69] shape_1.4.4 stringi_1.4.6 parallel_3.6.1 Rcpp_1.0.4.6
[73] vctrs_0.3.1 png_0.1-7 dbplyr_1.4.2 tidyselect_1.1.0
[77] xfun_0.14